ci: cache target dir for CUDA codspeed bench (~7× faster build)#8256
Closed
joseph-isaacs wants to merge 1 commit into
Closed
ci: cache target dir for CUDA codspeed bench (~7× faster build)#8256joseph-isaacs wants to merge 1 commit into
joseph-isaacs wants to merge 1 commit into
Conversation
f415021 to
6e4adf1
Compare
Merging this PR will improve performance by 23.7%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
46.6 µs | 31.7 µs | +46.98% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 216.9 ns | +26.89% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
336.9 ns | 278.6 ns | +20.94% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
213.2 µs | 177.1 µs | +20.41% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[2048] |
400.6 ns | 342.2 ns | +17.05% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
309.6 µs | 274.7 µs | +12.71% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ci/cuda-bench-target-cache (a1c1818) with develop (d97d2bd)
867154b to
9b1845f
Compare
9b1845f to
e165551
Compare
The CUDA codspeed shards run on ephemeral GPU runners with no prebuilt AMI, so cargo's target/ dir is cold on every run. sccache (S3) already serves ~98% of compilation, but cargo still re-runs build scripts and re-links unchanged dependency crates. Restore target/ with rust-cache (as publish-dry-runs.yml does), keyed per shard and only saved on develop so PRs restore from develop's cache. The runner's extras=s3-cache (as bench.yml uses) routes it to the same S3 bucket as sccache, so there is no separate GitHub Actions cache backend. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
92d6a35 to
a1c1818
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The CUDA codspeed bench shards run on ephemeral
runs-onGPU runners with no prebuilt AMI (unlike the CPU shards, which usesetup-prebuildon the-pre-v2AMIs). cargo'starget/is cold on every run, so even though sccache (S3) serves ~98% of compilation, cargo still re-runs build scripts and re-links unchanged dependency crates.What this does
Restore
target/between runs withSwatinem/rust-cache, keyed per shard and saved only on develop, so PRs restore from develop's cache without polluting it.The cache is configured to match existing repo conventions as closely as possible:
extras=s3-cacheon the runner spec routes the cache to the same S3 bucket as sccache — the same mechanismbench.yml/bench-pr.ymluse. No separate GitHub Actions cache backend.Swatinem/rust-cachestep mirrors its only other use in the repo (publish-dry-runs.yml):save-if: develop.Net diff is CI-only:
+12 / −1incodspeed.yml.Measurements (controlled experiment on the GPU runner)
BUILD_SECONDStarget/, sccache warm (codegen-units = 1)The often-quoted ~500s figure was from before
[profile.bench] codegen-units = 16landed and before sccache was warm; the realistic cold build is ~2 min. This branch is rebased on develop so it builds withcodegen-units = 16; the fresh run measures that realistic baseline. Sincesave-ifis develop-only, this PR's own runs are cold (no develop cache exists yet) — the warm benefit only appears after the first develop run populates the per-shard caches.🤖 Generated with Claude Code